DataArray
.XArray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, XArray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM).
DataArray
The DataArray
is one of the basic building blocks of XArray. It provides a NumPy ndarray-like object that expands to provide two critical pieces of functionality:
In [ ]:
# Convention for import to get shortened namespace
import numpy as np
import xarray as xr
In [ ]:
# Create some sample "temperature" data
data = 283 + 5 * np.random.randn(5, 3, 4)
data
Here we create a basic DataArray
by passing it just a numpy array of random data. Note that XArray generates some basic dimension names for us.
In [ ]:
temp = xr.DataArray(data)
temp
We can also pass in our own dimension names:
In [ ]:
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp
This is already improved upon from a numpy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the DataArray
.
In [ ]:
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2018-01-01', periods=5)
times
In [ ]:
# Sample lon/lats
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)
When we create the DataArray
instance, we pass in the arrays we just created:
In [ ]:
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp
...and we can also set some attribute metadata:
In [ ]:
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'
temp
Notice what happens if we perform a mathematical operaton with the DataArray
: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.
In [ ]:
# For example, convert Kelvin to Celsius
temp - 273.15
In [ ]:
temp.sel(time='2018-01-02')
.sel
has the flexibility to also perform nearest neighbor sampling, taking an optional tolerance:
In [ ]:
from datetime import timedelta
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))
.interp()
works similarly to .sel()
. Using .interp()
, get an interpolated time series "forecast" for Boulder (40°N, 105°W) or your favorite latitude/longitude location. (Documentation for interp).
In [ ]:
# Your code goes here
In [ ]:
# %load solutions/interp_solution.py
In [ ]:
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))
In [ ]:
# As done above
temp.loc['2018-01-02']
In [ ]:
temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]
In [ ]:
# This *doesn't* work however:
#temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']
In [ ]:
# Open sample North American Reanalysis data in netCDF format
ds = xr.open_dataset('../../data/NARR_19930313_0000.nc')
ds
This returns a Dataset
object, which is a container that contains one or more DataArray
s, which can also optionally share coordinates. We can then pull out individual fields:
In [ ]:
ds.isobaric1
or
In [ ]:
ds['isobaric1']
Dataset
s also support much of the same subsetting operations as DataArray
, but will perform the operation on all data:
In [ ]:
ds_1000 = ds.sel(isobaric1=1000.0)
ds_1000
In [ ]:
ds_1000.Temperature_isobaric
In [ ]:
u_winds = ds['u-component_of_wind_isobaric']
u_winds.std(dim=['x', 'y'])
Using the sample dataset, calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:
(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)
In [ ]:
# %load solutions/mean_profile.py
There is much more in the XArray library. To learn more, visit the XArray Documentation
In order to better enable reproducible data and research, the Climate and Forecasting (CF) metadata convention was created to have proper metadata in atmospheric data files. In the remainder of this notebook, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover gridded data in this notebook, with more in depth examples provided in the full CF notebook. Xarray makes the creation of netCDFs with proper metadata simple and straightforward, so we will use that, instead of the netCDF-Python library.
This assumes a basic understanding of netCDF.
Let's say we're working with some numerical weather forecast model output. Let's walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.
To start, let's assume the following about our data:
We'll also go ahead and generate some arrays of data below to get started:
In [ ]:
# Import some useful Python tools
from datetime import datetime
# Twelve hours of hourly output starting at 22Z today
start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)
times = np.array([start + timedelta(hours=h) for h in range(13)])
# 3km spacing in x and y
x = np.arange(-150, 153, 3)
y = np.arange(-100, 100, 3)
# Standard pressure levels in hPa
press = np.array([1000, 925, 850, 700, 500, 300, 250])
temps = np.random.randn(times.size, press.size, y.size, x.size)
Time coordinates must contain a units
attribute with a string value with a form similar to 'seconds since 2019-01-06 12:00:00.00'
. 'seconds', 'minutes', 'hours', and 'days' are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.
Before we can write data, we need to first need to convert our list of Python datetime
instances to numeric values. We can use the cftime
library to make this easy to convert using the unit string as defined above.
In [ ]:
from cftime import date2num
time_units = 'hours since {:%Y-%m-%d 00:00}'.format(times[0])
time_vals = date2num(times, time_units)
time_vals
Now we can create the forecast_time
variable just as we did before for the other coordinate variables:
In [ ]:
ds = xr.Dataset({'temperature': (['time', 'z', 'y', 'x'], temps, {'units':'Kelvin'})},
coords={'x_dist': (['x'], x, {'units':'km'}),
'y_dist': (['y'], y, {'units':'km'}),
'pressure': (['z'], press, {'units':'hPa'}),
'forecast_time': (['time'], times)
})
ds
Due to how xarray handles time units, we need to encode the units in the forecast_time
coordinate.
In [ ]:
ds.forecast_time.encoding['units'] = time_units
If we look at our data variable, we can see the units printed out, so they were attached properly!
In [ ]:
ds.temperature
We're going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they're easy to add and help users keep the data straight, so let's go ahead and do it.
In [ ]:
ds.attrs['Conventions'] = 'CF-1.7'
ds.attrs['title'] = 'Forecast model run'
ds.attrs['nc.institution'] = 'Unidata'
ds.attrs['source'] = 'WRF-1.5'
ds.attrs['history'] = str(datetime.utcnow()) + ' Python'
ds.attrs['references'] = ''
ds.attrs['comment'] = ''
ds
We can also add attributes to this variable to define metadata. The CF conventions require a units
attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we have already set it to a value of 'Kelvin'
. We also set the standard (optional) attributes of long_name
and standard_name
. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the missing_value
attribute to an appropriate value.
NASA Dataset Interoperability Recommendations:
Section 2.2 - Include Basic CF Attributes
Include where applicable:
units
,long_name
,standard_name
,valid_min
/valid_max
,scale_factor
/add_offset
and others.
In [ ]:
ds.temperature.attrs['standard_name'] = 'air_temperature'
ds.temperature.attrs['long_name'] = 'Forecast air temperature'
ds.temperature.attrs['missing_value'] = -9999
ds.temperature
To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called "Coordinate Variables". Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.
To start, we define variables which define our x
and y
coordinate values. These variables include standard_name
s which allow associating them with projections (more on this later) as well as an optional axis
attribute to make clear what standard direction this coordinate refers to.
In [ ]:
ds.x.attrs['axis'] = 'X' # Optional
ds.x.attrs['standard_name'] = 'projection_x_coordinate'
ds.x.attrs['long_name'] = 'x-coordinate in projected coordinate system'
ds.y.attrs['axis'] = 'Y' # Optional
ds.y.attrs['standard_name'] = 'projection_y_coordinate'
ds.y.attrs['long_name'] = 'y-coordinate in projected coordinate system'
We also define a coordinate variable pressure
to reference our data in the vertical dimension. The standard_name
of 'air_pressure'
is sufficient to identify this coordinate variable as the vertical axis, but let's go ahead and specify the axis
as well. We also specify the attribute positive
to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional.
In [ ]:
ds.pressure.attrs['axis'] = 'Z' # Optional
ds.pressure.attrs['standard_name'] = 'air_pressure'
ds.pressure.attrs['positive'] = 'down' # Optional
In [ ]:
ds.forecast_time['axis'] = 'T' # Optional
ds.forecast_time['standard_name'] = 'time' # Optional
ds.forecast_time['long_name'] = 'time'
Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called "auxillary coordinate variables", not because they are extra, but because they are not simple one dimensional variables.
Below, we first generate longitude and latitude values from our projected coordinates using the pyproj
library.
In [ ]:
from pyproj import Proj
X, Y = np.meshgrid(x, y)
lcc = Proj({'proj':'lcc', 'lon_0':-105, 'lat_0':40, 'a':6371000.,
'lat_1':25})
lon, lat = lcc(X * 1000, Y * 1000, inverse=True)
Now we can create the needed variables. Both are dimensioned on y
and x
and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of 'degrees_east'
, as well as the optional 'longitude'
standard_name
attribute. The case is the same for latitude, except the units are 'degrees_north'
and the standard_name
is 'latitude'
.
In [ ]:
ds = ds.assign_coords(lon = (['y', 'x'], lon))
ds = ds.assign_coords(lat = (['y', 'x'], lat))
ds
In [ ]:
ds.lon.attrs['units'] = 'degrees_east'
ds.lon.attrs['standard_name'] = 'longitude' # Optional
ds.lon.attrs['long_name'] = 'longitude'
ds.lat.attrs['units'] = 'degrees_north'
ds.lat.attrs['standard_name'] = 'latitude' # Optional
ds.lat.attrs['long_name'] = 'latitude'
With the variables created, we identify these variables as containing coordinates for the Temperature
variable by setting the coordinates
value to a space-separated list of the names of the auxilliary coordinate variables:
In [ ]:
ds
With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a "grid mapping" variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their grid_mapping
attribute.
Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The grid_mapping_name
attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings.
In [ ]:
ds['lambert_projection'] = int()
ds.lambert_projection.attrs['grid_mapping_name'] = 'lambert_conformal_conic'
ds.lambert_projection.attrs['standard_parallel'] = 25.
ds.lambert_projection.attrs['latitude_of_projection_origin'] = 40.
ds.lambert_projection.attrs['longitude_of_central_meridian'] = -105.
ds.lambert_projection.attrs['semi_major_axis'] = 6371000.0
ds.lambert_projection
Now that we created the variable, all that's left is to set the grid_mapping
attribute on our Temperature
variable to the name of our dummy variable:
In [ ]:
ds.temperature.attrs['grid_mapping'] = 'lambert_projection' # or proj_var.name
ds
Xarray has built-in support for a few flavors of netCDF. Here we'll write a netCDF4 file from our Dataset.
In [ ]:
ds.to_netcdf('test_netcdf.nc', format='NETCDF4')
In [ ]:
!ncdump test_netcdf.nc
In [ ]: